Game of Thrones - Battle Prediction

Game of Thrones is a popular fantasy TV show based on a series of books written by George RR Martin.

This notebook showcases the analysis and predictions of the battles in the series.

Motivation

Load packages

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
from string import Template
from openpyxl import load_workbook
import statsmodels.api as sm
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pylab import rcParams
from collections import Counter
from time import time
from pandas_profiling import ProfileReport
from IPython.display import display
# Import supplementary visualization code visuals.py
import visuals as vs
rcParams['figure.figsize'] = 5, 6
plt.style.use('ggplot')
In [2]:
# @hidden_cell
import warnings
warnings.filterwarnings("ignore")

Load dataset

In [3]:
battles_df = pd.read_csv('../../data/battles.csv')
battles_df.head()
Out[3]:
name year battle_number attacker_king defender_king attacker_1 attacker_2 attacker_3 attacker_4 defender_1 ... major_death major_capture attacker_size defender_size attacker_commander defender_commander summer location region note
0 Battle of Winterfell 299 12 Balon/Euron Greyjoy Robb Stark Greyjoy NaN NaN NaN Stark ... 0.0 1.0 20.0 NaN Theon Greyjoy Bran Stark 1.0 Winterfell The North It isn't mentioned how many Stark men are left...
1 Sack of Harrenhal 299 18 Robb Stark Joffrey/Tommen Baratheon Stark NaN NaN NaN Lannister ... 1.0 0.0 100.0 100.0 Roose Bolton, Vargo Hoat, Robett Glover Amory Lorch 1.0 Harrenhal The Riverlands NaN
2 Battle of Torrhen's Square 299 11 Robb Stark Balon/Euron Greyjoy Stark NaN NaN NaN Greyjoy ... 0.0 0.0 244.0 900.0 Rodrik Cassel, Cley Cerwyn Dagmer Cleftjaw 1.0 Torrhen's Square The North Greyjoy's troop number comes from the 264 esti...
3 Battle of the Stony Shore 299 10 Balon/Euron Greyjoy Robb Stark Greyjoy NaN NaN NaN Stark ... 0.0 0.0 264.0 NaN Theon Greyjoy NaN 1.0 Stony Shore The North Greyjoy's troop number based on the Battle of ...
4 Sack of Winterfell 299 14 Joffrey/Tommen Baratheon Robb Stark Bolton Greyjoy NaN NaN Stark ... 1.0 0.0 618.0 2000.0 Ramsay Snow, Theon Greyjoy Rodrik Cassel, Cley Cerwyn, Leobald Tallhart 1.0 Winterfell The North Since House Bolton betrays the Starks for Hous...

5 rows × 25 columns

In reviewing the other kernels to see what has been done, one particular kernel on Kaggle pointed out the data entry mistake on the Battle of Castle Rock. Having watched the TV series, I know for a fact that Mance Rayder has 100K wildings and Stannis Baratheon has 1,240 troops. I flipped the names in the dataset. This should be a major callout to anyone using this dataset

Data Cleaning

Change attacker_outcome to boolean and fill nan with zeros for major_death, major_capture, summer.

In [4]:
battles_df['attacker_outcome_flag'] = battles_df['attacker_outcome'].map({'win': 1, 'loss': 0})

battles_df['attacker_outcome_flag'] = battles_df['attacker_outcome_flag'].fillna(0)
battles_df['major_death'] = battles_df['major_death'].fillna(0)
battles_df['major_capture'] = battles_df['major_capture'].fillna(0)
battles_df['summer'] = battles_df['summer'].fillna(0)
battles_df[['attacker_outcome','major_death','major_capture','summer']]
Out[4]:
attacker_outcome major_death major_capture summer
0 win 0.0 1.0 1.0
1 win 1.0 0.0 1.0
2 win 0.0 0.0 1.0
3 win 0.0 0.0 1.0
4 win 1.0 0.0 1.0
5 win 0.0 0.0 1.0
6 win 0.0 1.0 0.0
7 win 1.0 1.0 1.0
8 win 0.0 0.0 0.0
9 win 0.0 0.0 0.0
10 loss 1.0 0.0 1.0
11 win 1.0 1.0 1.0
12 win 0.0 0.0 0.0
13 win 1.0 0.0 1.0
14 NaN 0.0 0.0 0.0
15 win 0.0 0.0 1.0
16 win 0.0 0.0 1.0
17 win 1.0 1.0 1.0
18 win 1.0 0.0 1.0
19 win 0.0 1.0 1.0
20 loss 1.0 1.0 1.0
21 loss 0.0 0.0 1.0
22 loss 1.0 1.0 1.0
23 loss 1.0 1.0 0.0
24 win 1.0 0.0 1.0
25 win 0.0 0.0 0.0
26 win 0.0 1.0 1.0
27 win 0.0 0.0 1.0
28 win 0.0 0.0 0.0
29 win 1.0 0.0 1.0
30 win 0.0 1.0 1.0
31 win 0.0 0.0 0.0
32 win 0.0 0.0 0.0
33 win 0.0 0.0 1.0
34 win 0.0 0.0 0.0
35 win 0.0 0.0 0.0
36 win 0.0 0.0 1.0
37 win 0.0 0.0 1.0

Run pandas profiler for fast EDA

In [5]:
profile = ProfileReport(battles_df, title='Game of Thrones Battles - Pandas Profiling Report', style={'full_width':True})
profile
Out[5]:

In [6]:
profile.to_file(output_file="got_battles_data_profile.html")

The columns attacker_2,attacker_3,attacker_4,defender_2,defender_3,defender_4, attacker_commander, defender_commander can be used to count the number of houses involved in the battle.

Three columns will be created for:

  • Number of attacking houses
  • Number of defending houses
  • Number of attacker_commander
  • Number of defender_commander
  • Battle Size
In [7]:
battles_df['attack_houses'] = battles_df[['attacker_1','attacker_2','attacker_3','attacker_4']].notnull().sum(axis=1)
battles_df['attack_houses'] = pd.to_numeric(battles_df.attack_houses)

battles_df['defender_houses'] = battles_df[['defender_1','defender_2','defender_3','defender_4']].notnull().sum(axis=1)
battles_df['defender_houses'] = pd.to_numeric(battles_df.defender_houses)

# Check data
battles_df[['attacker_1','attacker_2','attacker_3','attacker_4','attack_houses','defender_1','defender_2','defender_3','defender_4','defender_houses']].sort_values(by=['attack_houses','defender_houses'],ascending=[False,False])
Out[7]:
attacker_1 attacker_2 attacker_3 attacker_4 attack_houses defender_1 defender_2 defender_3 defender_4 defender_houses
14 Baratheon Karstark Mormont Glover 4 Bolton Frey NaN NaN 2
12 Baratheon Karstark Mormont Glover 4 Greyjoy NaN NaN NaN 1
23 Free folk Thenns Giants NaN 3 Night's Watch Baratheon NaN NaN 2
4 Bolton Greyjoy NaN NaN 2 Stark NaN NaN NaN 1
6 Bracken Lannister NaN NaN 2 Blackwood NaN NaN NaN 1
7 Stark Tully NaN NaN 2 Lannister NaN NaN NaN 1
9 Lannister Frey NaN NaN 2 Tully NaN NaN NaN 1
11 Frey Bolton NaN NaN 2 Stark NaN NaN NaN 1
15 Stark Tully NaN NaN 2 Lannister NaN NaN NaN 1
17 Stark Tully NaN NaN 2 Lannister NaN NaN NaN 1
0 Greyjoy NaN NaN NaN 1 Stark NaN NaN NaN 1
1 Stark NaN NaN NaN 1 Lannister NaN NaN NaN 1
2 Stark NaN NaN NaN 1 Greyjoy NaN NaN NaN 1
3 Greyjoy NaN NaN NaN 1 Stark NaN NaN NaN 1
5 Greyjoy NaN NaN NaN 1 Stark NaN NaN NaN 1
8 Baratheon NaN NaN NaN 1 Baratheon NaN NaN NaN 1
10 Stark NaN NaN NaN 1 Lannister NaN NaN NaN 1
13 Baratheon NaN NaN NaN 1 Baratheon NaN NaN NaN 1
16 Stark NaN NaN NaN 1 Lannister NaN NaN NaN 1
18 Lannister NaN NaN NaN 1 Tully NaN NaN NaN 1
19 Lannister NaN NaN NaN 1 Tully NaN NaN NaN 1
20 Stark NaN NaN NaN 1 Lannister NaN NaN NaN 1
21 Lannister NaN NaN NaN 1 Tully NaN NaN NaN 1
22 Baratheon NaN NaN NaN 1 Lannister NaN NaN NaN 1
24 Lannister NaN NaN NaN 1 Baratheon NaN NaN NaN 1
25 Baratheon NaN NaN NaN 1 Baratheon NaN NaN NaN 1
26 Frey NaN NaN NaN 1 Mallister NaN NaN NaN 1
27 Lannister NaN NaN NaN 1 Darry NaN NaN NaN 1
28 Lannister NaN NaN NaN 1 Stark NaN NaN NaN 1
29 Lannister NaN NaN NaN 1 Brave Companions NaN NaN NaN 1
30 Greyjoy NaN NaN NaN 1 Stark NaN NaN NaN 1
31 Greyjoy NaN NaN NaN 1 Tyrell NaN NaN NaN 1
32 Greyjoy NaN NaN NaN 1 Tyrell NaN NaN NaN 1
33 Darry NaN NaN NaN 1 Lannister NaN NaN NaN 1
34 Bolton NaN NaN NaN 1 Greyjoy NaN NaN NaN 1
36 Greyjoy NaN NaN NaN 1 Stark NaN NaN NaN 1
37 Brotherhood without Banners NaN NaN NaN 1 Brave Companions NaN NaN NaN 1
35 Brave Companions NaN NaN NaN 1 NaN NaN NaN NaN 0

Count occurence of attacker_commander and defender_commander

In [8]:
battles_df['attacker_commander'].str.split(',', expand=True)
Out[8]:
0 1 2 3 4 5
0 Theon Greyjoy None None None None None
1 Roose Bolton Vargo Hoat Robett Glover None None None
2 Rodrik Cassel Cley Cerwyn None None None None
3 Theon Greyjoy None None None None None
4 Ramsay Snow Theon Greyjoy None None None None
5 Asha Greyjoy None None None None None
6 Jonos Bracken Jaime Lannister None None None None
7 Robb Stark Brynden Tully None None None None
8 Loras Tyrell Raxter Redwyne None None None None
9 Daven Lannister Ryman Fey Jaime Lannister None None None
10 Robertt Glover Helman Tallhart None None None None
11 Walder Frey Roose Bolton Walder Rivers None None None
12 Stannis Baratheon Alysane Mormot None None None None
13 Stannis Baratheon Davos Seaworth None None None None
14 Stannis Baratheon None None None None None
15 Robb Stark Tytos Blackwood Brynden Tully None None None
16 Robb Stark Smalljon Umber Black Walder Frey None None None
17 Robb Stark Brynden Tully None None None None
18 Jaime Lannister None None None None None
19 Jaime Lannister Andros Brax None None None None
20 Roose Bolton Wylis Manderly Medger Cerwyn Harrion Karstark Halys Hornwood None
21 Tywin Lannister Flement Brax Gregor Clegane Addam Marbrand Lyle Crakehall Leo Lefford
22 Stannis Baratheon Imry Florent Guyard Morrigen Rolland Storm Salladhor Saan Davos Seaworth
23 Mance Rayder Tormund Giantsbane Harma Dogshead Magnar Styr Varamyr None
24 Gregor Clegane None None None None None
25 Mace Tyrell Mathis Rowan None None None None
26 Walder Frey None None None None None
27 Gregor Clegane None None None None None
28 Gregor Clegane None None None None None
29 Gregor Clegane None None None None None
30 Dagmer Cleftjaw None None None None None
31 Euron Greyjoy Victarion Greyjoy None None None None
32 Euron Greyjoy Victarion Greyjoy None None None None
33 Helman Tallhart None None None None None
34 Ramsey Bolton None None None None None
35 Rorge None None None None None
36 Victarion Greyjoy None None None None None
37 NaN NaN NaN NaN NaN NaN
In [9]:
battles_df['attacker_commander_count'] = battles_df['attacker_commander'].str.split(',', expand=True).notnull().sum(axis=1)
battles_df[['attacker_commander','attacker_commander_count']]
Out[9]:
attacker_commander attacker_commander_count
0 Theon Greyjoy 1
1 Roose Bolton, Vargo Hoat, Robett Glover 3
2 Rodrik Cassel, Cley Cerwyn 2
3 Theon Greyjoy 1
4 Ramsay Snow, Theon Greyjoy 2
5 Asha Greyjoy 1
6 Jonos Bracken, Jaime Lannister 2
7 Robb Stark, Brynden Tully 2
8 Loras Tyrell, Raxter Redwyne 2
9 Daven Lannister, Ryman Fey, Jaime Lannister 3
10 Robertt Glover, Helman Tallhart 2
11 Walder Frey, Roose Bolton, Walder Rivers 3
12 Stannis Baratheon, Alysane Mormot 2
13 Stannis Baratheon, Davos Seaworth 2
14 Stannis Baratheon 1
15 Robb Stark, Tytos Blackwood, Brynden Tully 3
16 Robb Stark, Smalljon Umber, Black Walder Frey 3
17 Robb Stark, Brynden Tully 2
18 Jaime Lannister 1
19 Jaime Lannister, Andros Brax 2
20 Roose Bolton, Wylis Manderly, Medger Cerwyn, H... 5
21 Tywin Lannister, Flement Brax, Gregor Clegane,... 6
22 Stannis Baratheon, Imry Florent, Guyard Morrig... 6
23 Mance Rayder, Tormund Giantsbane, Harma Dogshe... 5
24 Gregor Clegane 1
25 Mace Tyrell, Mathis Rowan 2
26 Walder Frey 1
27 Gregor Clegane 1
28 Gregor Clegane 1
29 Gregor Clegane 1
30 Dagmer Cleftjaw 1
31 Euron Greyjoy, Victarion Greyjoy 2
32 Euron Greyjoy, Victarion Greyjoy 2
33 Helman Tallhart 1
34 Ramsey Bolton 1
35 Rorge 1
36 Victarion Greyjoy 1
37 NaN 0
In [10]:
battles_df['defender_commander'].str.split(',', expand=True)
Out[10]:
0 1 2 3 4 5 6
0 Bran Stark None None None None None None
1 Amory Lorch None None None None None None
2 Dagmer Cleftjaw None None None None None None
3 NaN NaN NaN NaN NaN NaN NaN
4 Rodrik Cassel Cley Cerwyn Leobald Tallhart None None None None
5 NaN NaN NaN NaN NaN NaN NaN
6 Tytos Blackwood None None None None None None
7 Jaime Lannister None None None None None None
8 Rolland Storm None None None None None None
9 Brynden Tully None None None None None None
10 Randyll Tarly Gregor Clegane None None None None None
11 Robb Stark None None None None None None
12 Asha Greyjoy None None None None None None
13 Renly Baratheon Cortnay Penrose Loras Tyrell Randyll Tarly Mathis Rowan None None
14 Roose Bolton None None None None None None
15 Lord Andros Brax Forley Prester None None None None None
16 Rolph Spicer None None None None None None
17 Stafford Lannister Roland Crakehall Antario Jast None None None None
18 Clement Piper Vance None None None None None
19 Edmure Tully Tytos Blackwood None None None None None
20 Tywin Lannister Gregor Clegane Kevan Lannister Addam Marbrand None None None
21 Edmure Tully Jason Mallister Karyl Vance None None None None
22 Tyrion Lannister Jacelyn Bywater Sandor Clegane Tywin Lannister Garlan Tyrell Mace Tyrell Randyll Tarly
23 Stannis Baratheon Jon Snow Donal Noye Cotter Pyke None None None
24 Beric Dondarrion None None None None None None
25 Gilbert Farring None None None None None None
26 Jason Mallister None None None None None None
27 Lyman Darry None None None None None None
28 Roose Bolton Wylis Manderly None None None None None
29 Vargo Hoat None None None None None None
30 NaN NaN NaN NaN NaN NaN NaN
31 NaN NaN NaN NaN NaN NaN NaN
32 NaN NaN NaN NaN NaN NaN NaN
33 NaN NaN NaN NaN NaN NaN NaN
34 NaN NaN NaN NaN NaN NaN NaN
35 NaN NaN NaN NaN NaN NaN NaN
36 NaN NaN NaN NaN NaN NaN NaN
37 NaN NaN NaN NaN NaN NaN NaN
In [11]:
battles_df['defender_commander_count'] = battles_df['defender_commander'].str.split(',', expand=True).notnull().sum(axis=1)
battles_df[['defender_commander','defender_commander_count']]
Out[11]:
defender_commander defender_commander_count
0 Bran Stark 1
1 Amory Lorch 1
2 Dagmer Cleftjaw 1
3 NaN 0
4 Rodrik Cassel, Cley Cerwyn, Leobald Tallhart 3
5 NaN 0
6 Tytos Blackwood 1
7 Jaime Lannister 1
8 Rolland Storm 1
9 Brynden Tully 1
10 Randyll Tarly, Gregor Clegane 2
11 Robb Stark 1
12 Asha Greyjoy 1
13 Renly Baratheon, Cortnay Penrose, Loras Tyrell... 5
14 Roose Bolton 1
15 Lord Andros Brax, Forley Prester 2
16 Rolph Spicer 1
17 Stafford Lannister, Roland Crakehall, Antario ... 3
18 Clement Piper, Vance 2
19 Edmure Tully, Tytos Blackwood 2
20 Tywin Lannister, Gregor Clegane, Kevan Lannist... 4
21 Edmure Tully, Jason Mallister, Karyl Vance 3
22 Tyrion Lannister, Jacelyn Bywater, Sandor Cleg... 7
23 Stannis Baratheon, Jon Snow, Donal Noye, Cotte... 4
24 Beric Dondarrion 1
25 Gilbert Farring 1
26 Jason Mallister 1
27 Lyman Darry 1
28 Roose Bolton, Wylis Manderly 2
29 Vargo Hoat 1
30 NaN 0
31 NaN 0
32 NaN 0
33 NaN 0
34 NaN 0
35 NaN 0
36 NaN 0
37 NaN 0

Drop columns with missing data

In [12]:
battles_df = battles_df.drop(columns = ['battle_number','attacker_2','attacker_3','attacker_4','defender_2','defender_3','defender_4','note'])
battles_df.head()
Out[12]:
name year attacker_king defender_king attacker_1 defender_1 attacker_outcome battle_type major_death major_capture ... attacker_commander defender_commander summer location region attacker_outcome_flag attack_houses defender_houses attacker_commander_count defender_commander_count
0 Battle of Winterfell 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark win ambush 0.0 1.0 ... Theon Greyjoy Bran Stark 1.0 Winterfell The North 1.0 1 1 1 1
1 Sack of Harrenhal 299 Robb Stark Joffrey/Tommen Baratheon Stark Lannister win ambush 1.0 0.0 ... Roose Bolton, Vargo Hoat, Robett Glover Amory Lorch 1.0 Harrenhal The Riverlands 1.0 1 1 3 1
2 Battle of Torrhen's Square 299 Robb Stark Balon/Euron Greyjoy Stark Greyjoy win pitched battle 0.0 0.0 ... Rodrik Cassel, Cley Cerwyn Dagmer Cleftjaw 1.0 Torrhen's Square The North 1.0 1 1 2 1
3 Battle of the Stony Shore 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark win ambush 0.0 0.0 ... Theon Greyjoy NaN 1.0 Stony Shore The North 1.0 1 1 1 0
4 Sack of Winterfell 299 Joffrey/Tommen Baratheon Robb Stark Bolton Stark win ambush 1.0 0.0 ... Ramsay Snow, Theon Greyjoy Rodrik Cassel, Cley Cerwyn, Leobald Tallhart 1.0 Winterfell The North 1.0 2 1 2 3

5 rows × 22 columns

Create battle_size for the total number of people involved in a battle.

In [13]:
battles_df['battle_size'] = battles_df['attacker_size'] + battles_df['defender_size']
battles_df[['attacker_size','defender_size','battle_size']]
Out[13]:
attacker_size defender_size battle_size
0 20.0 NaN NaN
1 100.0 100.0 200.0
2 244.0 900.0 1144.0
3 264.0 NaN NaN
4 618.0 2000.0 2618.0
5 1000.0 NaN NaN
6 1500.0 NaN NaN
7 1875.0 6000.0 7875.0
8 2000.0 NaN NaN
9 3000.0 NaN NaN
10 3000.0 NaN NaN
11 3500.0 3500.0 7000.0
12 4500.0 200.0 4700.0
13 5000.0 20000.0 25000.0
14 5000.0 8000.0 13000.0
15 6000.0 12625.0 18625.0
16 6000.0 NaN NaN
17 6000.0 10000.0 16000.0
18 15000.0 4000.0 19000.0
19 15000.0 10000.0 25000.0
20 18000.0 20000.0 38000.0
21 20000.0 10000.0 30000.0
22 21000.0 7250.0 28250.0
23 100000.0 1240.0 101240.0
24 NaN 120.0 NaN
25 NaN 200.0 NaN
26 NaN NaN NaN
27 NaN NaN NaN
28 NaN 6000.0 NaN
29 NaN NaN NaN
30 NaN NaN NaN
31 NaN NaN NaN
32 NaN NaN NaN
33 NaN NaN NaN
34 NaN NaN NaN
35 NaN NaN NaN
36 NaN NaN NaN
37 NaN NaN NaN

Plot correlation

In [14]:
corr_plot = battles_df.corr(method='pearson').style.set_caption('Correlation for Game of Thrones Battles').background_gradient(cmap='coolwarm').set_precision(4)
corr_plot
Out[14]:
Correlation for Game of Thrones Battles
year major_death major_capture attacker_size defender_size summer attacker_outcome_flag attack_houses defender_houses attacker_commander_count defender_commander_count battle_size
year 1 -0.3563 -0.1841 0.1558 -0.366 -0.8105 -0.03909 0.3188 0.1238 -0.005841 -0.2217 0.2017
major_death -0.3563 1 0.2736 0.2726 0.06158 0.3706 -0.2962 0.06181 0.1305 0.3308 0.5853 0.1881
major_capture -0.1841 0.2736 1 0.3355 0.234 0.184 -0.201 0.1234 0.1474 0.3088 0.3435 0.4066
attacker_size 0.1558 0.2726 0.3355 1 -0.1121 -0.2731 -0.5202 0.2364 0.6462 0.5371 0.4528 0.9652
defender_size -0.366 0.06158 0.234 -0.1121 1 0.3255 -0.2831 -0.199 -0.1024 0.258 0.5157 0.1515
summer -0.8105 0.3706 0.184 -0.2731 0.3255 1 0.01634 -0.3823 -0.1385 0.02564 0.1982 -0.38
attacker_outcome_flag -0.03909 -0.2962 -0.201 -0.5202 -0.2831 0.01634 1 -0.2437 -0.4752 -0.6564 -0.5795 -0.6039
attack_houses 0.3188 0.06181 0.1234 0.2364 -0.199 -0.3823 -0.2437 1 0.5559 0.1262 0.09444 0.08775
defender_houses 0.1238 0.1305 0.1474 0.6462 -0.1024 -0.1385 -0.4752 0.5559 1 0.1988 0.2179 0.5808
attacker_commander_count -0.005841 0.3308 0.3088 0.5371 0.258 0.02564 -0.6564 0.1262 0.1988 1 0.7026 0.5468
defender_commander_count -0.2217 0.5853 0.3435 0.4528 0.5157 0.1982 -0.5795 0.09444 0.2179 0.7026 1 0.5028
battle_size 0.2017 0.1881 0.4066 0.9652 0.1515 -0.38 -0.6039 0.08775 0.5808 0.5468 0.5028 1

Run profiler again

In [15]:
profile = ProfileReport(battles_df, title='Game of Thrones Battles - Pandas Profiling Report', style={'full_width':True})
profile
profile.to_file(output_file="got_battles_data_profile.html")

Write function to clean battles dataset

In [16]:
def clean_battle_data(df):
    df['attacker_outcome_flag'] = df['attacker_outcome'].map({'win': 1, 'loss': 0})

    # Fill NaN with zero
    df['attacker_outcome_flag'] = df['attacker_outcome_flag'].fillna(0)
    df['major_death'] = df['major_death'].fillna(0)
    df['major_capture'] = df['major_capture'].fillna(0)
    df['summer'] = df['summer'].fillna(0)
    df['attacker_size'] = df['attacker_size'].fillna(0)
    df['defender_size'] = df['defender_size'].fillna(0)

    # The columns attacker_2,attacker_3,attacker_4,defender_2,defender_3,defender_4, attacker_commander, defender_commander can be used to count the number of houses involved in the battle.
    df['attack_houses'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].notnull().sum(axis=1)
    df['attack_houses'] = pd.to_numeric(df.attack_houses)

    df['defender_houses'] = df[['defender_1','defender_2','defender_3','defender_4']].notnull().sum(axis=1)
    df['defender_houses'] = pd.to_numeric(df.defender_houses)

    # Count attacker_commander
    df['attacker_commander_count'] = df['attacker_commander'].str.split(',', expand=True).notnull().sum(axis=1)

    # Count defender_commander
    df['defender_commander_count'] = df['defender_commander'].str.split(',', expand=True).notnull().sum(axis=1)

    # Drop columns with missing data
    df = df.drop(columns = ['battle_number','attacker_2','attacker_3','attacker_4','defender_2','defender_3','defender_4','note'])

    # Create battle_size columns
    df['battle_size'] = df['attacker_size'] + df['defender_size']
    df['battle_size'] = df['battle_size'].fillna(0)

    return df

Test function

In [17]:
battles_df = pd.read_csv('../../data/battles.csv')
battles_df = clean_battle_data(battles_df)
battles_df.head()
Out[17]:
name year attacker_king defender_king attacker_1 defender_1 attacker_outcome battle_type major_death major_capture ... defender_commander summer location region attacker_outcome_flag attack_houses defender_houses attacker_commander_count defender_commander_count battle_size
0 Battle of Winterfell 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark win ambush 0.0 1.0 ... Bran Stark 1.0 Winterfell The North 1.0 1 1 1 1 20.0
1 Sack of Harrenhal 299 Robb Stark Joffrey/Tommen Baratheon Stark Lannister win ambush 1.0 0.0 ... Amory Lorch 1.0 Harrenhal The Riverlands 1.0 1 1 3 1 200.0
2 Battle of Torrhen's Square 299 Robb Stark Balon/Euron Greyjoy Stark Greyjoy win pitched battle 0.0 0.0 ... Dagmer Cleftjaw 1.0 Torrhen's Square The North 1.0 1 1 2 1 1144.0
3 Battle of the Stony Shore 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark win ambush 0.0 0.0 ... NaN 1.0 Stony Shore The North 1.0 1 1 1 0 264.0
4 Sack of Winterfell 299 Joffrey/Tommen Baratheon Robb Stark Bolton Stark win ambush 1.0 0.0 ... Rodrik Cassel, Cley Cerwyn, Leobald Tallhart 1.0 Winterfell The North 1.0 2 1 2 3 2618.0

5 rows × 23 columns

Export clean data

In [18]:
battles_df.to_csv('../../data/battles_clean.csv', index = False)

Exploratory Data Analysis

In [19]:
profile
Out[19]:

Using the Fast EDA (one dimension):

attacker_outcome
32 battles out of 38 battles were won (84.2%).

attacker_king
On the offense, Joffrey/Tommen Baratheon were the attacking kings 36.8% of the time (14 battles) while Mance Rayder was only the attacking king once (5.3%).

defender_king
On the defense, Robb Stark has been attacked 36.8% of the times (14 battles) while Joffrey/Tommen Baratheon were second (34.2% or 13 battles). Renly Baratheon defended once (2.6%).

battle_type
We can see that the most common battle_type is pitched battle, appearing 36.8% (14 times) while razing was only 5.3% (twice).

region
Most of the battles were fought in The Riverlands (44.7% or 17 battles) while the second most battles fought were in The North (26.3% or 10 battles). There was only one battle Beyond The Wall (2.6%).

summer
Most of the battles were fought in the summer (26 or 68.4%) while the remaing were fought in the winter (12 battles or 31.6%).

year
Majority of the battles were fought in the year 299 (52.6% or 20 battles) and the second in the year 300 (28.9% or 11 battles). The remainder year 298 had only 7 battles (18.4%).

Multiple dimensional view

We will examine using multiple variables to see how they play together.

  • attacker_king vs attacker_outcome
  • defender_king vs attacker_outcome
  • battle_type vs attacker_outcome
  • summer vs battle_type vs attacker_outcome
  • battle_type vs attacker_king vs attacker_outcome
  • battle_type vs defender_king vs attacker_outcome
In [20]:
df_grouped = battles_df.groupby(by=['attacker_king']).agg(
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'attacker_outcome_flag_count', ascending = False)

df_grouped['attacker_outcome_loss'] = df_grouped['attacker_outcome_flag_count'] - df_grouped['attacker_outcome_wins']
df_grouped['attacker_outcome_wins_pct'] = (df_grouped['attacker_outcome_wins']/df_grouped['attacker_outcome_flag_count']) * 100
df_grouped['attacker_outcome_loss_pct'] = 100 - df_grouped['attacker_outcome_wins_pct']
df_grouped
Out[20]:
attacker_king attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
1 Joffrey/Tommen Baratheon 14 13.0 4329.857143 2558.571429 1.0 92.857143 7.142857
3 Robb Stark 10 8.0 4121.900000 4962.500000 2.0 80.000000 20.000000
0 Balon/Euron Greyjoy 7 7.0 183.428571 0.000000 0.0 100.000000 0.000000
4 Stannis Baratheon 4 2.0 8875.000000 8862.500000 2.0 50.000000 50.000000
2 Mance Rayder 1 0.0 100000.000000 1240.000000 1.0 0.000000 100.000000
In [21]:
df_grouped[['attacker_king','attacker_outcome_wins_pct','attacker_outcome_loss_pct']].plot.bar(x='attacker_king')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x131a02090>
In [22]:
df_grouped[['attacker_king','attacker_size_mean','defender_size_mean']].plot.bar(x='attacker_king')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x131ab9390>
In [23]:
# Remove Mance Rayder
df_grouped[['attacker_king','attacker_size_mean','defender_size_mean']][df_grouped.attacker_king != 'Mance Rayder'].plot.bar(x='attacker_king')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x131a92b90>

Battle year

In [24]:
df_year = battles_df.groupby(by=['year']).agg(
    battles = ('name','count'),
    major_death = ('major_death', 'sum'),
    major_capture = ('major_capture','sum')).reset_index().sort_values(by = 'year', ascending = True)
    
df_year.plot.bar(x='year')
display(df_year)
year battles major_death major_capture
0 298 7 4.0 3.0
1 299 20 8.0 6.0
2 300 11 1.0 2.0

Battle vs Region

In [25]:
df_region = battles_df.groupby(by=['region']).agg(
    battles_count = ('name','count'),
    major_death = ('major_death', 'sum'),
    major_capture = ('major_capture','sum')).reset_index().sort_values(by = 'battles_count', ascending = False)
    
df_region.plot.bar(x='region')
display(df_region)
region battles_count major_death major_capture
4 The Riverlands 17 6.0 6.0
2 The North 10 1.0 2.0
5 The Stormlands 3 1.0 0.0
6 The Westerlands 3 2.0 1.0
1 The Crownlands 2 2.0 1.0
3 The Reach 2 0.0 0.0
0 Beyond the Wall 1 1.0 1.0

Battle_Type vs Kings

In [26]:
df_battle_type = battles_df.groupby(by=['battle_type']).agg(
    battles_count = ('name','count'),
    major_death = ('major_death', 'sum'),
    major_capture = ('major_capture','sum')).reset_index().sort_values(by = 'battles_count', ascending = False)
    
df_battle_type.plot.bar(x='battle_type')
display(df_battle_type)
battle_type battles_count major_death major_capture
1 pitched battle 14 5.0 3.0
3 siege 11 2.0 4.0
0 ambush 10 6.0 4.0
2 razing 2 0.0 0.0
In [27]:
pd.value_counts(battles_df['region']).plot.bar()
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x131c6ba50>
In [28]:
pd.value_counts(battles_df['battle_type']).plot.bar()
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x129472d90>
In [29]:
pd.value_counts(battles_df['attacker_1']).plot.bar()
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1323a2bd0>
In [30]:
pd.value_counts(battles_df['defender_1']).plot.bar()
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x12951ead0>
In [31]:
pd.value_counts(battles_df['summer']).plot.bar()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x137cd39d0>
In [32]:
battles_df['attacker_size'].hist(bins=20)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x13481bc10>
In [33]:
battles_df['defender_size'].hist(bins=20)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x132413c90>
In [34]:
battles_df['attack_houses'].hist(bins=10)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x131bec8d0>
In [35]:
battles_df['defender_houses'].hist(bins=10)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x131afa4d0>
In [36]:
battles_df['attacker_commander_count'].hist(bins=10)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x13824d190>
In [37]:
battles_df['defender_commander_count'].hist(bins=10)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x13243f5d0>

battle_type vs attacker_outcome

In [38]:
df_battle_type = battles_df.groupby(by=['battle_type']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_battle_type['attacker_outcome_loss'] = df_battle_type['attacker_outcome_flag_count'] - df_battle_type['attacker_outcome_wins']
df_battle_type['attacker_outcome_wins_pct'] = (df_battle_type['attacker_outcome_wins']/df_battle_type['attacker_outcome_flag_count']) * 100
df_battle_type['attacker_outcome_loss_pct'] = 100 - df_battle_type['attacker_outcome_wins_pct']
df_battle_type
Out[38]:
battle_type battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
1 pitched battle 14 14 10.0 6910.285714 4167.857143 4.0 71.428571 28.571429
3 siege 11 11 10.0 10227.272727 1949.090909 1.0 90.909091 9.090909
0 ambush 10 10 10.0 2437.700000 3434.500000 0.0 100.000000 0.000000
2 razing 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
In [39]:
sns.factorplot(x="battle_type", y="attacker_outcome_wins_pct",
            aspect=0.8,
            kind="bar", data=df_battle_type)
Out[39]:
<seaborn.axisgrid.FacetGrid at 0x132351e90>

Summer vs battle_type vs attacker_outcome

In [40]:
df_battle_type_summer = battles_df.groupby(by=['summer','battle_type','attacker_outcome']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_battle_type_summer['attacker_outcome_loss'] = df_battle_type_summer['attacker_outcome_flag_count'] - df_battle_type_summer['attacker_outcome_wins']
df_battle_type_summer['attacker_outcome_wins_pct'] = (df_battle_type_summer['attacker_outcome_wins']/df_battle_type_summer['attacker_outcome_flag_count']) * 100
df_battle_type_summer['attacker_outcome_loss_pct'] = 100 - df_battle_type_summer['attacker_outcome_wins_pct']
df_battle_type_summer
Out[40]:
summer battle_type attacker_outcome battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
4 1.0 ambush win 10 10 10.0 2437.700000 3434.500000 0.0 100.0 0.0
6 1.0 pitched battle win 7 7 7.0 4320.571429 2128.571429 0.0 100.0 0.0
3 0.0 siege win 5 5 5.0 1300.000000 40.000000 0.0 100.0 0.0
7 1.0 siege win 5 5 5.0 1200.000000 4000.000000 0.0 100.0 0.0
5 1.0 pitched battle loss 4 4 0.0 15500.000000 9312.500000 4.0 0.0 100.0
0 0.0 pitched battle win 3 3 3.0 1500.000000 2066.666667 0.0 100.0 0.0
1 0.0 razing win 2 2 2.0 0.000000 0.000000 0.0 100.0 0.0
2 0.0 siege loss 1 1 0.0 100000.000000 1240.000000 1.0 0.0 100.0
In [41]:
sns.factorplot(x="battle_type", y="attacker_outcome_wins_pct",
            col="summer", aspect=1,
            kind="bar", data=df_battle_type_summer)
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x132290790>

battle_type vs. attacker_king

In [42]:
df_battle_attacker_king = battles_df.groupby(by=['attacker_king','battle_type']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_battle_attacker_king['attacker_outcome_loss'] = df_battle_attacker_king['attacker_outcome_flag_count'] - df_battle_attacker_king['attacker_outcome_wins']
df_battle_attacker_king['attacker_outcome_wins_pct'] = (df_battle_attacker_king['attacker_outcome_wins']/df_battle_attacker_king['attacker_outcome_flag_count']) * 100
df_battle_attacker_king['attacker_outcome_loss_pct'] = 100 - df_battle_attacker_king['attacker_outcome_wins_pct']
df_battle_attacker_king
Out[42]:
attacker_king battle_type battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
5 Joffrey/Tommen Baratheon pitched battle 6 6 5.0 8333.333333 5000.000000 1.0 83.333333 16.666667
6 Joffrey/Tommen Baratheon siege 5 5 5.0 1300.000000 40.000000 0.0 100.000000 0.000000
8 Robb Stark ambush 5 5 5.0 3995.000000 5745.000000 0.0 100.000000 0.000000
4 Joffrey/Tommen Baratheon ambush 3 3 3.0 1372.666667 1873.333333 0.0 100.000000 0.000000
9 Robb Stark pitched battle 3 3 1.0 7081.333333 6966.666667 2.0 33.333333 66.666667
0 Balon/Euron Greyjoy ambush 2 2 2.0 142.000000 0.000000 0.0 100.000000 0.000000
1 Balon/Euron Greyjoy pitched battle 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
3 Balon/Euron Greyjoy siege 2 2 2.0 500.000000 0.000000 0.0 100.000000 0.000000
10 Robb Stark siege 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
11 Stannis Baratheon pitched battle 2 2 1.0 12750.000000 3725.000000 1.0 50.000000 50.000000
2 Balon/Euron Greyjoy razing 1 1 1.0 0.000000 0.000000 0.0 100.000000 0.000000
7 Mance Rayder siege 1 1 0.0 100000.000000 1240.000000 1.0 0.000000 100.000000
12 Stannis Baratheon siege 1 1 1.0 5000.000000 20000.000000 0.0 100.000000 0.000000
In [43]:
chart = sns.catplot(x="attacker_king", y="attacker_outcome_wins_pct",
                       col="battle_type", aspect=1,
                       kind="bar", data=df_battle_attacker_king)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[43]:
<seaborn.axisgrid.FacetGrid at 0x12f946e90>

battle_type vs. defender_king

In [44]:
df_battle_defender_king = battles_df.groupby(by=['defender_king','battle_type']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_battle_defender_king['attacker_outcome_loss'] = df_battle_defender_king['attacker_outcome_flag_count'] - df_battle_defender_king['attacker_outcome_wins']
df_battle_defender_king['attacker_outcome_wins_pct'] = (df_battle_defender_king['attacker_outcome_wins']/df_battle_defender_king['attacker_outcome_flag_count']) * 100
df_battle_defender_king['attacker_outcome_loss_pct'] = 100 - df_battle_defender_king['attacker_outcome_wins_pct']
df_battle_defender_king
Out[44]:
defender_king battle_type battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
8 Robb Stark pitched battle 6 6 5.0 8333.333333 5000.0 1.0 83.333333 16.666667
2 Joffrey/Tommen Baratheon ambush 5 5 5.0 3995.000000 5745.0 0.0 100.000000 0.000000
7 Robb Stark ambush 5 5 5.0 880.400000 1124.0 0.0 100.000000 0.000000
3 Joffrey/Tommen Baratheon pitched battle 4 4 1.0 10500.000000 6812.5 3.0 25.000000 75.000000
9 Robb Stark siege 3 3 3.0 1833.333333 0.0 0.0 100.000000 0.000000
10 Stannis Baratheon siege 3 3 2.0 34000.000000 480.0 1.0 66.666667 33.333333
0 Balon/Euron Greyjoy pitched battle 2 2 2.0 2372.000000 550.0 0.0 100.000000 0.000000
1 Balon/Euron Greyjoy siege 2 2 2.0 0.000000 0.0 0.0 100.000000 0.000000
5 Joffrey/Tommen Baratheon siege 2 2 2.0 0.000000 0.0 0.0 100.000000 0.000000
4 Joffrey/Tommen Baratheon razing 1 1 1.0 0.000000 0.0 0.0 100.000000 0.000000
6 Renly Baratheon siege 1 1 1.0 5000.000000 20000.0 0.0 100.000000 0.000000
In [45]:
chart = sns.catplot(x="defender_king", y="attacker_outcome_loss_pct",
                       col="battle_type", aspect=1,
                       kind="bar", data=df_battle_defender_king)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[45]:
<seaborn.axisgrid.FacetGrid at 0x131be9490>

attacker_king vs defender_king

In [46]:
df_attack_defend = battles_df.groupby(by=['attacker_king','defender_king']).agg(
    battles_count = ('name','count'),
    attacker_outcome_flag_count = ('attacker_outcome_flag','count'),
    attacker_outcome_wins = ('attacker_outcome_flag','sum'),
    attacker_size_mean = ('attacker_size', 'mean'),
    defender_size_mean = ('defender_size','mean')).reset_index().sort_values(by = 'battles_count', ascending = False)

df_attack_defend['attacker_outcome_loss'] = df_attack_defend['attacker_outcome_flag_count'] - df_attack_defend['attacker_outcome_wins']
df_attack_defend['attacker_outcome_wins_pct'] = (df_attack_defend['attacker_outcome_wins']/df_attack_defend['attacker_outcome_flag_count']) * 100
df_attack_defend['attacker_outcome_loss_pct'] = 100 - df_attack_defend['attacker_outcome_wins_pct']
df_attack_defend
Out[46]:
attacker_king defender_king battles_count attacker_outcome_flag_count attacker_outcome_wins attacker_size_mean defender_size_mean attacker_outcome_loss attacker_outcome_wins_pct attacker_outcome_loss_pct
4 Joffrey/Tommen Baratheon Robb Stark 10 10 9.0 5861.800000 3562.000000 1.0 90.000000 10.000000
8 Robb Stark Joffrey/Tommen Baratheon 9 9 7.0 4552.777778 5413.888889 2.0 77.777778 22.222222
2 Balon/Euron Greyjoy Robb Stark 4 4 4.0 321.000000 0.000000 0.0 100.000000 0.000000
1 Balon/Euron Greyjoy Joffrey/Tommen Baratheon 2 2 2.0 0.000000 0.000000 0.0 100.000000 0.000000
5 Joffrey/Tommen Baratheon Stannis Baratheon 2 2 2.0 1000.000000 100.000000 0.0 100.000000 0.000000
10 Stannis Baratheon Joffrey/Tommen Baratheon 2 2 0.0 13000.000000 7625.000000 2.0 0.000000 100.000000
0 Balon/Euron Greyjoy Balon/Euron Greyjoy 1 1 1.0 0.000000 0.000000 0.0 100.000000 0.000000
3 Joffrey/Tommen Baratheon Balon/Euron Greyjoy 1 1 1.0 0.000000 0.000000 0.0 100.000000 0.000000
6 Mance Rayder Stannis Baratheon 1 1 0.0 100000.000000 1240.000000 1.0 0.000000 100.000000
7 Robb Stark Balon/Euron Greyjoy 1 1 1.0 244.000000 900.000000 0.0 100.000000 0.000000
9 Stannis Baratheon Balon/Euron Greyjoy 1 1 1.0 4500.000000 200.000000 0.0 100.000000 0.000000
11 Stannis Baratheon Renly Baratheon 1 1 1.0 5000.000000 20000.000000 0.0 100.000000 0.000000
In [47]:
chart = sns.catplot(x="attacker_king", y="battles_count",
                       col="defender_king", aspect=1,
                       kind="bar", data=df_attack_defend)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[47]:
<seaborn.axisgrid.FacetGrid at 0x132f87b10>
In [48]:
chart = sns.catplot(x="attacker_king", y="attacker_outcome_loss_pct",
                       col="defender_king", aspect=1,
                       kind="bar", data=df_attack_defend)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[48]:
<seaborn.axisgrid.FacetGrid at 0x13337cbd0>
In [49]:
chart = sns.catplot(x="attacker_king", y="attacker_outcome_wins_pct",
                       col="defender_king", aspect=1,
                       kind="bar", data=df_attack_defend)
chart.set_xticklabels(rotation=45, horizontalalignment='right')
Out[49]:
<seaborn.axisgrid.FacetGrid at 0x1330a0310>

Battle Size

In [50]:
chart = sns.boxplot(x="attacker_king", y="attacker_size", data=battles_df[battles_df.attacker_king != 'Mance Rayder'], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
Out[50]:
[Text(0, 0, 'Balon/Euron Greyjoy'),
 Text(0, 0, 'Robb Stark'),
 Text(0, 0, 'Joffrey/Tommen Baratheon'),
 Text(0, 0, 'Stannis Baratheon')]
In [51]:
chart = sns.boxplot(x="attacker_king", y="attacker_size", data=battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_outcome == 'win')], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
Out[51]:
[Text(0, 0, 'Balon/Euron Greyjoy'),
 Text(0, 0, 'Robb Stark'),
 Text(0, 0, 'Joffrey/Tommen Baratheon'),
 Text(0, 0, 'Stannis Baratheon')]
In [52]:
chart = sns.boxplot(x="attacker_king", y="attacker_size", data=battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_outcome == 'loss')], palette="Set1")
chart.set_xticklabels(chart.get_xticklabels(),rotation=30)
Out[52]:
[Text(0, 0, 'Robb Stark'),
 Text(0, 0, 'Joffrey/Tommen Baratheon'),
 Text(0, 0, 'Stannis Baratheon')]

Start Modeling Here

There are two types of machine learning types that I will be performing:

1) Regression Model for attacker_size vs 'defender_size` to determine the army size a house will be fighting against

2) Classification Model - What factors determine the attacker_outcome?

Regression Model

In [53]:
sns.regplot(x='attacker_size',y='defender_size',data=battles_df)
display(battles_df[['attacker_size','defender_size']].corr())
attacker_size defender_size
attacker_size 1.00000 0.17306
defender_size 0.17306 1.00000
In [54]:
battles_df1 = battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_size > 0) & (battles_df.defender_size > 0)]

sns.lmplot(x='attacker_size', y='defender_size', data=battles_df1[['attacker_size','defender_size']],
           robust=True, ci=None, scatter_kws={"s": 95})
display(battles_df1[['attacker_size','defender_size']].corr())
attacker_size defender_size
attacker_size 1.000000 0.438731
defender_size 0.438731 1.000000
In [55]:
battles_df1 = battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_size > 0) & (battles_df.defender_size > 0)& (battles_df.attacker_size < 14000) & (battles_df.defender_size < 20000)]
sns.lmplot(x='attacker_size', y='defender_size', data=battles_df1[['attacker_size','defender_size']],
           robust=True, ci=None, scatter_kws={"s": 80})
display(battles_df1[['attacker_size','defender_size']].corr())
attacker_size defender_size
attacker_size 1.000000 0.758063
defender_size 0.758063 1.000000
In [56]:
battles_df1 = battles_df[(battles_df.attacker_king != 'Mance Rayder') & (battles_df.attacker_outcome_flag == 1) & (battles_df.attacker_size > 0) & (battles_df.defender_size > 0) & (battles_df.attacker_size < 14000) & (battles_df.defender_size < 20000)]
sns.lmplot(x='attacker_size', y='defender_size', data=battles_df1[['attacker_size','defender_size']],
           robust=True, ci=None, scatter_kws={"s": 80})
display(battles_df1[['attacker_size','defender_size']].corr())
attacker_size defender_size
attacker_size 1.000000 0.738457
defender_size 0.738457 1.000000

Classification Modeling

In [57]:
# Drop columns with missing data
model_df = battles_df.drop(columns = ['name','location','attacker_commander','defender_commander','attacker_outcome','attacker_outcome_flag','battle_size'])
attacker_outcome = battles_df['attacker_outcome_flag']
In [58]:
model_df.dtypes
Out[58]:
year                          int64
attacker_king                object
defender_king                object
attacker_1                   object
defender_1                   object
battle_type                  object
major_death                 float64
major_capture               float64
attacker_size               float64
defender_size               float64
summer                      float64
region                       object
attack_houses                 int64
defender_houses               int64
attacker_commander_count      int64
defender_commander_count      int64
dtype: object
In [59]:
model_df[model_df.columns[0:10]]
Out[59]:
year attacker_king defender_king attacker_1 defender_1 battle_type major_death major_capture attacker_size defender_size
0 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark ambush 0.0 1.0 20.0 0.0
1 299 Robb Stark Joffrey/Tommen Baratheon Stark Lannister ambush 1.0 0.0 100.0 100.0
2 299 Robb Stark Balon/Euron Greyjoy Stark Greyjoy pitched battle 0.0 0.0 244.0 900.0
3 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark ambush 0.0 0.0 264.0 0.0
4 299 Joffrey/Tommen Baratheon Robb Stark Bolton Stark ambush 1.0 0.0 618.0 2000.0
5 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark siege 0.0 0.0 1000.0 0.0
6 300 Joffrey/Tommen Baratheon Robb Stark Bracken Blackwood siege 0.0 1.0 1500.0 0.0
7 298 Robb Stark Joffrey/Tommen Baratheon Stark Lannister ambush 1.0 1.0 1875.0 6000.0
8 300 Joffrey/Tommen Baratheon Stannis Baratheon Baratheon Baratheon siege 0.0 0.0 2000.0 0.0
9 300 Joffrey/Tommen Baratheon Robb Stark Lannister Tully siege 0.0 0.0 3000.0 0.0
10 299 Robb Stark Joffrey/Tommen Baratheon Stark Lannister pitched battle 1.0 0.0 3000.0 0.0
11 299 Joffrey/Tommen Baratheon Robb Stark Frey Stark ambush 1.0 1.0 3500.0 3500.0
12 300 Stannis Baratheon Balon/Euron Greyjoy Baratheon Greyjoy pitched battle 0.0 0.0 4500.0 200.0
13 299 Stannis Baratheon Renly Baratheon Baratheon Baratheon siege 1.0 0.0 5000.0 20000.0
14 300 Stannis Baratheon Joffrey/Tommen Baratheon Baratheon Bolton NaN 0.0 0.0 5000.0 8000.0
15 298 Robb Stark Joffrey/Tommen Baratheon Stark Lannister ambush 0.0 0.0 6000.0 12625.0
16 299 Robb Stark Joffrey/Tommen Baratheon Stark Lannister ambush 0.0 0.0 6000.0 0.0
17 299 Robb Stark Joffrey/Tommen Baratheon Stark Lannister ambush 1.0 1.0 6000.0 10000.0
18 298 Joffrey/Tommen Baratheon Robb Stark Lannister Tully pitched battle 1.0 0.0 15000.0 4000.0
19 298 Joffrey/Tommen Baratheon Robb Stark Lannister Tully pitched battle 0.0 1.0 15000.0 10000.0
20 298 Robb Stark Joffrey/Tommen Baratheon Stark Lannister pitched battle 1.0 1.0 18000.0 20000.0
21 299 Joffrey/Tommen Baratheon Robb Stark Lannister Tully pitched battle 0.0 0.0 20000.0 10000.0
22 299 Stannis Baratheon Joffrey/Tommen Baratheon Baratheon Lannister pitched battle 1.0 1.0 21000.0 7250.0
23 300 Mance Rayder Stannis Baratheon Free folk Night's Watch siege 1.0 1.0 100000.0 1240.0
24 298 Joffrey/Tommen Baratheon Robb Stark Lannister Baratheon ambush 1.0 0.0 0.0 120.0
25 300 Joffrey/Tommen Baratheon Stannis Baratheon Baratheon Baratheon siege 0.0 0.0 0.0 200.0
26 299 Robb Stark Joffrey/Tommen Baratheon Frey Mallister siege 0.0 1.0 0.0 0.0
27 298 Joffrey/Tommen Baratheon Robb Stark Lannister Darry pitched battle 0.0 0.0 0.0 0.0
28 299 Joffrey/Tommen Baratheon Robb Stark Lannister Stark pitched battle 0.0 0.0 0.0 6000.0
29 299 Joffrey/Tommen Baratheon NaN Lannister Brave Companions pitched battle 1.0 0.0 0.0 0.0
30 299 Balon/Euron Greyjoy Balon/Euron Greyjoy Greyjoy Stark siege 0.0 1.0 0.0 0.0
31 300 Balon/Euron Greyjoy Joffrey/Tommen Baratheon Greyjoy Tyrell pitched battle 0.0 0.0 0.0 0.0
32 300 Balon/Euron Greyjoy Joffrey/Tommen Baratheon Greyjoy Tyrell razing 0.0 0.0 0.0 0.0
33 299 Robb Stark Joffrey/Tommen Baratheon Darry Lannister siege 0.0 0.0 0.0 0.0
34 300 Joffrey/Tommen Baratheon Balon/Euron Greyjoy Bolton Greyjoy siege 0.0 0.0 0.0 0.0
35 300 NaN NaN Brave Companions NaN razing 0.0 0.0 0.0 0.0
36 299 Balon/Euron Greyjoy Robb Stark Greyjoy Stark pitched battle 0.0 0.0 0.0 0.0
37 299 NaN NaN Brotherhood without Banners Brave Companions pitched battle 0.0 0.0 0.0 0.0
In [60]:
model_df[model_df.columns[11:23]]
Out[60]:
region attack_houses defender_houses attacker_commander_count defender_commander_count
0 The North 1 1 1 1
1 The Riverlands 1 1 3 1
2 The North 1 1 2 1
3 The North 1 1 1 0
4 The North 2 1 2 3
5 The North 1 1 1 0
6 The Riverlands 2 1 2 1
7 The Riverlands 2 1 2 1
8 The Stormlands 1 1 2 1
9 The Riverlands 2 1 3 1
10 The Crownlands 1 1 2 2
11 The Riverlands 2 1 3 1
12 The North 4 1 2 1
13 The Stormlands 1 1 2 5
14 The North 4 2 1 1
15 The Riverlands 2 1 3 2
16 The Westerlands 1 1 3 1
17 The Westerlands 2 1 2 3
18 The Westerlands 1 1 1 2
19 The Riverlands 1 1 2 2
20 The Riverlands 1 1 5 4
21 The Riverlands 1 1 6 3
22 The Crownlands 1 1 6 7
23 Beyond the Wall 3 2 5 4
24 The Riverlands 1 1 1 1
25 The Stormlands 1 1 2 1
26 The Riverlands 1 1 1 1
27 The Riverlands 1 1 1 1
28 The Riverlands 1 1 1 2
29 The Riverlands 1 1 1 1
30 The North 1 1 1 0
31 The Reach 1 1 2 0
32 The Reach 1 1 2 0
33 The Riverlands 1 1 1 0
34 The North 1 1 1 0
35 The Riverlands 1 0 1 0
36 The North 1 1 1 0
37 The Riverlands 1 1 0 0
In [61]:
model_df[['attacker_king','defender_king','attacker_1','defender_1']] = model_df[['attacker_king','defender_king','attacker_1','defender_1']].replace('/','_',regex=True)
model_df[['attacker_king','defender_king','region','battle_type','attacker_1','defender_1']] = model_df[['attacker_king','defender_king','region','battle_type','attacker_1','defender_1']].replace(' ','_',regex=True)
model_df[model_df.columns[0:10]]
Out[61]:
year attacker_king defender_king attacker_1 defender_1 battle_type major_death major_capture attacker_size defender_size
0 299 Balon_Euron_Greyjoy Robb_Stark Greyjoy Stark ambush 0.0 1.0 20.0 0.0
1 299 Robb_Stark Joffrey_Tommen_Baratheon Stark Lannister ambush 1.0 0.0 100.0 100.0
2 299 Robb_Stark Balon_Euron_Greyjoy Stark Greyjoy pitched_battle 0.0 0.0 244.0 900.0
3 299 Balon_Euron_Greyjoy Robb_Stark Greyjoy Stark ambush 0.0 0.0 264.0 0.0
4 299 Joffrey_Tommen_Baratheon Robb_Stark Bolton Stark ambush 1.0 0.0 618.0 2000.0
5 299 Balon_Euron_Greyjoy Robb_Stark Greyjoy Stark siege 0.0 0.0 1000.0 0.0
6 300 Joffrey_Tommen_Baratheon Robb_Stark Bracken Blackwood siege 0.0 1.0 1500.0 0.0
7 298 Robb_Stark Joffrey_Tommen_Baratheon Stark Lannister ambush 1.0 1.0 1875.0 6000.0
8 300 Joffrey_Tommen_Baratheon Stannis_Baratheon Baratheon Baratheon siege 0.0 0.0 2000.0 0.0
9 300 Joffrey_Tommen_Baratheon Robb_Stark Lannister Tully siege 0.0 0.0 3000.0 0.0
10 299 Robb_Stark Joffrey_Tommen_Baratheon Stark Lannister pitched_battle 1.0 0.0 3000.0 0.0
11 299 Joffrey_Tommen_Baratheon Robb_Stark Frey Stark ambush 1.0 1.0 3500.0 3500.0
12 300 Stannis_Baratheon Balon_Euron_Greyjoy Baratheon Greyjoy pitched_battle 0.0 0.0 4500.0 200.0
13 299 Stannis_Baratheon Renly_Baratheon Baratheon Baratheon siege 1.0 0.0 5000.0 20000.0
14 300 Stannis_Baratheon Joffrey_Tommen_Baratheon Baratheon Bolton NaN 0.0 0.0 5000.0 8000.0
15 298 Robb_Stark Joffrey_Tommen_Baratheon Stark Lannister ambush 0.0 0.0 6000.0 12625.0
16 299 Robb_Stark Joffrey_Tommen_Baratheon Stark Lannister ambush 0.0 0.0 6000.0 0.0
17 299 Robb_Stark Joffrey_Tommen_Baratheon Stark Lannister ambush 1.0 1.0 6000.0 10000.0
18 298 Joffrey_Tommen_Baratheon Robb_Stark Lannister Tully pitched_battle 1.0 0.0 15000.0 4000.0
19 298 Joffrey_Tommen_Baratheon Robb_Stark Lannister Tully pitched_battle 0.0 1.0 15000.0 10000.0
20 298 Robb_Stark Joffrey_Tommen_Baratheon Stark Lannister pitched_battle 1.0 1.0 18000.0 20000.0
21 299 Joffrey_Tommen_Baratheon Robb_Stark Lannister Tully pitched_battle 0.0 0.0 20000.0 10000.0
22 299 Stannis_Baratheon Joffrey_Tommen_Baratheon Baratheon Lannister pitched_battle 1.0 1.0 21000.0 7250.0
23 300 Mance_Rayder Stannis_Baratheon Free_folk Night's_Watch siege 1.0 1.0 100000.0 1240.0
24 298 Joffrey_Tommen_Baratheon Robb_Stark Lannister Baratheon ambush 1.0 0.0 0.0 120.0
25 300 Joffrey_Tommen_Baratheon Stannis_Baratheon Baratheon Baratheon siege 0.0 0.0 0.0 200.0
26 299 Robb_Stark Joffrey_Tommen_Baratheon Frey Mallister siege 0.0 1.0 0.0 0.0
27 298 Joffrey_Tommen_Baratheon Robb_Stark Lannister Darry pitched_battle 0.0 0.0 0.0 0.0
28 299 Joffrey_Tommen_Baratheon Robb_Stark Lannister Stark pitched_battle 0.0 0.0 0.0 6000.0
29 299 Joffrey_Tommen_Baratheon NaN Lannister Brave_Companions pitched_battle 1.0 0.0 0.0 0.0
30 299 Balon_Euron_Greyjoy Balon_Euron_Greyjoy Greyjoy Stark siege 0.0 1.0 0.0 0.0
31 300 Balon_Euron_Greyjoy Joffrey_Tommen_Baratheon Greyjoy Tyrell pitched_battle 0.0 0.0 0.0 0.0
32 300 Balon_Euron_Greyjoy Joffrey_Tommen_Baratheon Greyjoy Tyrell razing 0.0 0.0 0.0 0.0
33 299 Robb_Stark Joffrey_Tommen_Baratheon Darry Lannister siege 0.0 0.0 0.0 0.0
34 300 Joffrey_Tommen_Baratheon Balon_Euron_Greyjoy Bolton Greyjoy siege 0.0 0.0 0.0 0.0
35 300 NaN NaN Brave_Companions NaN razing 0.0 0.0 0.0 0.0
36 299 Balon_Euron_Greyjoy Robb_Stark Greyjoy Stark pitched_battle 0.0 0.0 0.0 0.0
37 299 NaN NaN Brotherhood_without_Banners Brave_Companions pitched_battle 0.0 0.0 0.0 0.0
In [62]:
model_df[model_df.columns[10:24]]
Out[62]:
summer region attack_houses defender_houses attacker_commander_count defender_commander_count
0 1.0 The_North 1 1 1 1
1 1.0 The_Riverlands 1 1 3 1
2 1.0 The_North 1 1 2 1
3 1.0 The_North 1 1 1 0
4 1.0 The_North 2 1 2 3
5 1.0 The_North 1 1 1 0
6 0.0 The_Riverlands 2 1 2 1
7 1.0 The_Riverlands 2 1 2 1
8 0.0 The_Stormlands 1 1 2 1
9 0.0 The_Riverlands 2 1 3 1
10 1.0 The_Crownlands 1 1 2 2
11 1.0 The_Riverlands 2 1 3 1
12 0.0 The_North 4 1 2 1
13 1.0 The_Stormlands 1 1 2 5
14 0.0 The_North 4 2 1 1
15 1.0 The_Riverlands 2 1 3 2
16 1.0 The_Westerlands 1 1 3 1
17 1.0 The_Westerlands 2 1 2 3
18 1.0 The_Westerlands 1 1 1 2
19 1.0 The_Riverlands 1 1 2 2
20 1.0 The_Riverlands 1 1 5 4
21 1.0 The_Riverlands 1 1 6 3
22 1.0 The_Crownlands 1 1 6 7
23 0.0 Beyond_the_Wall 3 2 5 4
24 1.0 The_Riverlands 1 1 1 1
25 0.0 The_Stormlands 1 1 2 1
26 1.0 The_Riverlands 1 1 1 1
27 1.0 The_Riverlands 1 1 1 1
28 0.0 The_Riverlands 1 1 1 2
29 1.0 The_Riverlands 1 1 1 1
30 1.0 The_North 1 1 1 0
31 0.0 The_Reach 1 1 2 0
32 0.0 The_Reach 1 1 2 0
33 1.0 The_Riverlands 1 1 1 0
34 0.0 The_North 1 1 1 0
35 0.0 The_Riverlands 1 0 1 0
36 1.0 The_North 1 1 1 0
37 1.0 The_Riverlands 1 1 0 0
In [63]:
categorical_feature_mask = model_df.dtypes==object
categorical_cols = model_df.columns[categorical_feature_mask].tolist()
categorical_cols
Out[63]:
['attacker_king',
 'defender_king',
 'attacker_1',
 'defender_1',
 'battle_type',
 'region']
In [64]:
model_df1 = pd.get_dummies(model_df, columns=categorical_cols, prefix = categorical_cols)
model_df1.head()
Out[64]:
year major_death major_capture attacker_size defender_size summer attack_houses defender_houses attacker_commander_count defender_commander_count ... battle_type_pitched_battle battle_type_razing battle_type_siege region_Beyond_the_Wall region_The_Crownlands region_The_North region_The_Reach region_The_Riverlands region_The_Stormlands region_The_Westerlands
0 299 0.0 1.0 20.0 0.0 1.0 1 1 1 1 ... 0 0 0 0 0 1 0 0 0 0
1 299 1.0 0.0 100.0 100.0 1.0 1 1 3 1 ... 0 0 0 0 0 0 0 1 0 0
2 299 0.0 0.0 244.0 900.0 1.0 1 1 2 1 ... 1 0 0 0 0 1 0 0 0 0
3 299 0.0 0.0 264.0 0.0 1.0 1 1 1 0 ... 0 0 0 0 0 1 0 0 0 0
4 299 1.0 0.0 618.0 2000.0 1.0 2 1 2 3 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 54 columns

In [65]:
model_df1.columns
Out[65]:
Index(['year', 'major_death', 'major_capture', 'attacker_size',
       'defender_size', 'summer', 'attack_houses', 'defender_houses',
       'attacker_commander_count', 'defender_commander_count',
       'attacker_king_Balon_Euron_Greyjoy',
       'attacker_king_Joffrey_Tommen_Baratheon', 'attacker_king_Mance_Rayder',
       'attacker_king_Robb_Stark', 'attacker_king_Stannis_Baratheon',
       'defender_king_Balon_Euron_Greyjoy',
       'defender_king_Joffrey_Tommen_Baratheon',
       'defender_king_Renly_Baratheon', 'defender_king_Robb_Stark',
       'defender_king_Stannis_Baratheon', 'attacker_1_Baratheon',
       'attacker_1_Bolton', 'attacker_1_Bracken',
       'attacker_1_Brave_Companions', 'attacker_1_Brotherhood_without_Banners',
       'attacker_1_Darry', 'attacker_1_Free_folk', 'attacker_1_Frey',
       'attacker_1_Greyjoy', 'attacker_1_Lannister', 'attacker_1_Stark',
       'defender_1_Baratheon', 'defender_1_Blackwood', 'defender_1_Bolton',
       'defender_1_Brave_Companions', 'defender_1_Darry', 'defender_1_Greyjoy',
       'defender_1_Lannister', 'defender_1_Mallister',
       'defender_1_Night's_Watch', 'defender_1_Stark', 'defender_1_Tully',
       'defender_1_Tyrell', 'battle_type_ambush', 'battle_type_pitched_battle',
       'battle_type_razing', 'battle_type_siege', 'region_Beyond_the_Wall',
       'region_The_Crownlands', 'region_The_North', 'region_The_Reach',
       'region_The_Riverlands', 'region_The_Stormlands',
       'region_The_Westerlands'],
      dtype='object')

Log transform defender_size and attacker_size

In [66]:
# Log-transform the skewed features
skewed = ['attacker_size', 'defender_size']
features_log_transformed = pd.DataFrame(data = model_df1)
features_log_transformed[skewed] = model_df1[skewed].apply(lambda x: np.log(x + 1))
In [67]:
features_log_transformed['attacker_size'].hist(bins=20)
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x1363bae90>
In [68]:
features_log_transformed['defender_size'].hist(bins=20)
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x136562510>

Normalizing Numerical Features

In [69]:
# attack_houses,defender_houses,attacker_commander_count,defender_commander_count,attacker_size,defender_size
# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['attack_houses','defender_houses','attacker_commander_count','defender_commander_count','attacker_size','defender_size']

features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])

# Show an example of a record with scaling applied
display(features_log_minmax_transform.head(n = 5))
year major_death major_capture attacker_size defender_size summer attack_houses defender_houses attacker_commander_count defender_commander_count ... battle_type_pitched_battle battle_type_razing battle_type_siege region_Beyond_the_Wall region_The_Crownlands region_The_North region_The_Reach region_The_Riverlands region_The_Stormlands region_The_Westerlands
0 299 0.0 1.0 0.264444 0.000000 1.0 0.000000 0.5 0.166667 0.142857 ... 0 0 0 0 0 1 0 0 0 0
1 299 1.0 0.0 0.400864 0.466007 1.0 0.000000 0.5 0.500000 0.142857 ... 0 0 0 0 0 0 0 1 0 0
2 299 0.0 0.0 0.477833 0.686977 1.0 0.000000 0.5 0.333333 0.142857 ... 1 0 0 0 0 1 0 0 0 0
3 299 0.0 0.0 0.484649 0.000000 1.0 0.000000 0.5 0.166667 0.000000 ... 0 0 0 0 0 1 0 0 0 0
4 299 1.0 0.0 0.558338 0.767544 1.0 0.333333 0.5 0.333333 0.428571 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 54 columns

In [70]:
features_final = features_log_transformed
features_final.to_csv('../../data/battles_data_model.csv',index = False)

Shuffle and split data

In [71]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_final, 
                                                    attacker_outcome, 
                                                    random_state = 0,
                                                    test_size = 0.25)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
Training set has 28 samples.
Testing set has 10 samples.

Model Selection

  • Logistic Regression

  • Random Forest

  • XG Boost

Logistic Regression

In [72]:
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)
display(accuracies.mean())
display(accuracies.std())
0.9166666666666666
0.2327373340628157

The accuracy using logistic regression is 91.6% with a standard deviation of 0.2327 and we have a good baseline classification model already which is good news.

Random Forest

In [73]:
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 20, criterion = 'entropy',random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)
display(accuracies.mean())
display(accuracies.std())
0.95
0.11902380714238082
In [74]:
# TODO: Extract the feature importances using .feature_importances_ 
importances = classifier.feature_importances_

# Plot
vs.feature_plot(importances, X_train, y_train)

The accuracy using random forest is 95.0% with a standard deviation of 0.119. The accuracy improved by approximately 3.6% and the standard deviation decreased by almost half.

XGBoost

In [75]:
# Fitting XGBoost to the Training set
from xgboost import XGBClassifier
classifier = XGBClassifier(random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)
display(accuracies.mean())
display(accuracies.std())
0.9
0.23804761428476168

Surprisingly, XGBoost has the worst accuracy compared to the Logistic Regression and Random Forest models (90%) but it is still a good model.

Random Forest has the best accuracy of 95%. The features importance are attacker_size, attacker_commander_count,attack_houses,defender_size, and defender_houses.

Implementation - Extracting Feature Importance

In [76]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 20, criterion = 'entropy',random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 20)
display(accuracies.mean())
display(accuracies.std())
0.95
0.11902380714238082
In [77]:
# TODO: Extract the feature importances using .feature_importances_ 
importances = classifier.feature_importances_
vs.feature_plot(importances, X_train, y_train)
In [78]:
# Import functionality for cloning a model
from sklearn.base import clone

# Reduce the feature space
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]

# Train on the "best" model found from grid search earlier
clf = classifier.fit(X_train_reduced, y_train)

# Make new predictions
reduced_predictions = clf.predict(X_test_reduced)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = clf, X = X_train_reduced, y = y_train, cv = 20)
display(accuracies.mean())
display(accuracies.std())
0.9666666666666666
0.1
In [79]:
importances = clf.feature_importances_
vs.feature_plot(importances, X_train_reduced, y_train)

After we reduced the features down to the 5 most significant predictors, we get a slight improvement in accuracy of 96.6% and the standard deviation decreased slightly to 0.1. This is okay as we now have a model that can predict battle outcomes for Game of Thrones!

The most important factors are:

attacker_size - The size of the attacking house matter (unless you are Balon/Euron Greyjoy who wins with smaller armies)
attacker_commander_count - The count of the attacker commander matters as well.
attack_houses - The number of attacking houses that are in the battle.
defender_size - The size of the defender house matter (unless you are Balon/Euron Greyjoy who wins with smaller armies)
defender_houses - The number of defending houses that are in the battle.

In [ ]: